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Abstract: 


JTLA 


The study evaluated the comparability of two versions of a certification test: a paper- 
and-pencil test (PPT) and computer-based test (CBT). An effect size measure known as 
Cohens d and differential item functioning (DIF) analyses were used as measures of com- 
parability at the test and item levels, respectively. Results indicated that the effect sizes 
were small (d < 0.20) and not statistically significant (p > 0.05), suggesting no substantial 
difference between the two test versions. Moreover, DIF analysis revealed that reading 
and mathematics items were comparable for both versions. However, three writing items 
were flagged for DIF. Substantive reviews failed to identify format differences that could 
explain the performance differences, so the causes of DIF could not be identified. 
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Introduction 

Perspectives/Theoretical Framework 

The effectiveness of achievement tests as tools that yield scores that 
can be validly interpreted regardless of the mode of delivery of these 
tests (e.g., paper and pencil vs. computer) is often questioned (American 
Educational Research Association, American Psychological Association, 
National Council on Measurement in Education, 1999). For example, 
scores derived from computer-based tests (CBT) as compared to paper- 
and-pencil tests (PPT) might reflect not only the examinee’s proficiency in 
the construct being measured but also the level of computer proficiency. 
This affects the construct being measured and disrupts the comparison 
and interpretation of test scores across the two modes of administration. 

Many testing programs are increasingly administering the same test 
in both PPT and CBT formats. For example, the TOEFL® program concur- 
rently delivers PPTs and CBTs in approximately 228 countries every year. 
Similarly, the GRE® General test is offered in the United States, Canada, 
and many other countries in paper-and pencil- and computer-based 
formats. More recently, the PRAXIS™ program started administering 
the Pre-Professional Skills Tests (PPST®) via CBT in addition to PPT. 
Mills, Potenza, Fremer, and Ward (2002) speculated that this trend will 
continue to grow because of an increase in availability of microcomputers 
in educational settings, a substantial improvement in the speed of com- 
puters, and a significant reduction in cost. 
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CBTs have many advantages over PPTs, which may include faster score 
reporting, savings on paper and personnel resources and costs of scoring 
services (Wise & Plake, 1990), and development of new methods of assess- 
ment such as simple adaptations of multiple- choice items to more innova- 
tive item types (Jodoin, 2003). Despite these advantages, an important 
question that arises when tests are administered in both formats is whether 
or not the scores produced are interchangeable (Gallagher, Bridgeman, & 
Cahalan, 2002). For example, scores derived from CBTs as compared to 
PPTs might reflect not only the examinees proficiency on the construct 
being measured but also differences in formatting (including typing versus 
handwriting) and/ or computer proficiency. 

There is a large body of research that documents the comparability of 
scores obtained from PPTs and CBTs. Mazzeo and Harvey (1988) in their 
review of studies comparing PPTs and CBTs indicated that CBTs tended 
to be more difficult than PPT versions of the same tests. Similarly, a 
meta-analysis conducted by Mead and Drasgow (1993) suggested that the 
constructs being measured across the two modes were similar for power 
tests but not for speeded tests. Finally, a more recent study by Gallagher 
et al. (2002) found that performance across PPT and CBT versions of tests 
differed for subgroups based on gender and ethnicity. For example, 
female test takers performed poorly on CBTs whereas African-American 
and Hispanic test takers benefited from this format. However, other 
researchers have found that PPTs and their CBT counterparts yield com- 
parable scores. For example, Taylor, Jamieson, Eignor, and Kirsch (1998) 
studied the comparability of PPTs and CBTs for the 1996 administration 
of the TOEFL and found no meaningful difference in performance for 
examinees taking the two different versions. Similarly, Wise, Barnes, 
Harvey, and Plake (1989) contended that PPT and CBT versions of achieve- 
ment tests yield very similar scores. 

Since the results of these studies are inconsistent and the use of com- 
puters have become commonplace, it is even more important to examine 
whether scores obtained from the two different modes of delivery are in 
fact comparable (Gallagher et al., 2002). In addition, the bulk of these 
studies have focused on mean differences in test performance across PPTs 
versus CBTs. However, in order to gain more precise understanding about 
the nature of differences between modes, item-level performance (e.g., dif- 
ferential item functioning or DIF) should also be assessed. For example, 
items based on longer reading passages displayed on multiple screens in 
CBT as compared to the same passages presented on a single page in PPTs 
may lead to differential performance on these items (Thompson, Thurlow, 
& Moore, 2002). Also, in the case of constructed response items, it has 
been found that raters are more lenient toward handwritten essays as 
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compared to computer typed essays. One reason for this occurrence, found 
by Arnold et al. (1990), was that raters gave students the benefit of the 
doubt in situations where the handwriting became difficult to read. 

Since fairness is an important concern in the field of educational 
measurement, it is important to ensure that scores obtained from both 
PPTs and CBTs are comparable and thus measure the same construct. 
Hence, the purpose of this study is to compare the performance of exam- 
inees who took the PPT version of a certification test with another group 
that took the same form in CBT. It should be noted that there was an 
overall difference between items in PPT and CBT formats for the full test. 
The items in PPT format had the multiple- choice options clearly marked 
as A, B, C, D, or no error. However, in the CBT format, these options were 
not marked as A, B, C, D, or no error. For some items in CBT format, where 
the task was to identify an error in a written passage, these options were 
presented as underlined texts in the passages and the examinees were 
instructed to click on the underlined text that denoted the correct option. 
For the remaining items, oval icons were provided beside the options and 
the examinees were instructed to click on the oval icon that denoted the 
correct option (the Appendix on page 19 shows hypothetical examples of 
this difference in PPT and CBT formats). This difference was assumed to 
have no effect on the examinees' ability to respond to an item in the PPT or 
CBT formats. In the United States, these tests hold extremely high stakes 
because test-takers who do not pass these tests are not eligible to enroll in 
teaching programs and then to teach in those states that require a passing 
score on these tests. Therefore, it becomes especially important to ensure 
that these tests do not unfairly favor one group of test- takers over another 
based on whether they took the test in PPT or CBT format. 
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Method 

Certification Test and Sample 

This study used test data collected in a 2003 administration of a large- 
scale certification test from the Praxis™ program. This test measures basic 
proficiency in reading, writing, and mathematics and is used for entrance 
into teaching programs. The paper-and-pencil and computerized versions 
of the reading and mathematics tests have 40 items each and the writing 
test has 45 items. In addition to the number of items stated above, the 
computerized version of the reading, writing, and mathematics tests has 
approximately five additional items that are pre-test items (used in the 
assembly of future forms) and are not used in the final scoring. The paper- 
based tests in reading and mathematics are each 60-minute multiple- 
choice tests. The writing test includes a 30-minute multiple-choice section 
and a 30-minute essay section. The computer-based tests in reading and 
mathematics are each 75-minute multiple-choice tests. The writing test 
includes a 45-minute multiple- choice section and a 30-minute essay sec- 
tion. The increase in time for the reading and mathematics sections and 
the multiple- choice section of the writing test in the computer-based tests 
is due to an increase in the total number of items resulting from the addi- 
tional pre-test items and also to allow for tutorials and the collection of 
background information from test- takers. 

Six groups of examinees classified by mode of administration (i.e., 
PPT vs. CBT) and content area (i.e., reading, writing, and mathematics) 
were analyzed. It should be noted that the examinees were free to choose 
between either a PPT or CBT version of the test and therefore, there was 
no random assignment of examinees to either version of these tests. This 
is important to note because performance differences found in PPT and 
CBT versions of tests may be due to actual ability differences in the test- 
taking populations (i.e., test impact) rather than differences in testing 
format (PPT vs. CBT). 

The sample sizes differed slightly for the analyses conducted at test and 
item levels. At the test level, there were 1,122 examinees in reading, 1,050 
examinees in writing, and 1,136 examinees in mathematics for the CBTs. 
An equal number of examinees were used for the PPTs in reading, writing, 
and mathematics, respectively. These examinees were sampled from a 
larger population in the PPT group based on propensity score matching (a 
more detailed discussion on propensity score matching will follow in the 
analytical procedure section). For the item-level analysis, there were 1,122 
examinees in reading, 1,050 examinees in writing, and 1,136 examinees 
in mathematics for the CBTs. For the PPTs, the study used two random 
sub-samples of 2,000 examinees from the larger population in reading, 
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writing, and mathematics, respectively. Thus, the item-level analysis 
involved two separate runs. For the first run, all available examinees for 
the CBTs and a random sub-sample of 2,000 examinees for the PPTs were 
used. In the second run, the exact same analysis was replicated with the 
second random sub-sample of 2,000 examinees for the PPTs. 

This study was broken into three steps: 

1. The standardized mean difference across the mode (i.e., PPT 
vs. CBT) of the test was evaluated in order to identify the 
overall difference in performance at the test level. 

2. Mode DIF analyses were conducted for the PPT and CBT 
versions of the tests to identify items that may function 
differentially across the two modes at the item level. 

3. Substantive analysis of the items flagged as DIF was 
conducted by test reviewers in order to identify the 
sources of mode DIF. 

Analytical Procedure 

At the test level, comparability of PPTs and CBTs was evaluated using 
an effect size measure commonly referred to as Cohen’s d (Cohen, 1988). 
Cohen’s d reports mean differences in terms of standard deviation units 
thereby establishing a common metric for comparing performance for 
examinees who took the paper-and-pencil or computerized versions of the 
tests. This effect size measure is calculated using the formula given below. 


where d is the effect size, X CA is the mean of the PPT group, X G2 is the 
mean of the CBT group, and SD Poo i ed is the pooled standard deviation of 
the PPT and CBT groups. According to Cohen (1992), the following guide- 
lines for defining the d statistic are meaningful: (a) small effect when d 
is approximately 0.20, (b) moderate effect when d is approximately 0.50, 
and (c) large effect when d is approximately 0.80. Also, all effect sizes were 
tested for statistical significance by converting the d statistic into a t-value 
(where d = 2t / \J (df) ) and checking whether this value was greater than 
the critical t-value in the t-distribution (Rosenthal & Rosnow, 1991). 

As mentioned earlier, the examinees were free to choose between 
either a PPT or CBT version of the test, and there was no random assign- 
ment of examinees to either version of these tests. Therefore, a simple 
comparison of performance between the PPT and CBT groups at the test 
level can be misleading in that such a comparison may not reveal the effect 


d = 
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of mode of delivery (PPT vs. CBT) on test performance per se because 
results can be confounded by other factors that lead these examinees to 
choose a particular testing format. To overcome this potential problem, 
the analysis used propensity scores (Rosenbaum & Rubin, 1983) to match 
PPT and CBT examinees on a single variable (i.e., the propensity score). 
This tended to balance any consistent differences in the distributions of 
these groups. The propensity score was estimated using a logistic regres- 
sion model where the dependent variable was testing mode (PPT or CBT) 
and the independent variables were gender, language, test repeater status, 
race, GPA, and educational level. The empirically estimated regression 
weights from the logistic regression were used to compute the propensity 
score for each examinee in the PPT and CBT groups, separately. The pro- 
pensity score was a weighted sum of the variables in the logistic regression 
and is denoted as: 

Y = h x x x + b 2 x 2 + b 3 x 3 + + b n x n 

where x lt x 2 , x 3 xn, represent the examinee’s values of the selected 

variables and b lt b 2 , b 3 , b n represent the weights determined by the 

logistic regression. 

After the propensity scores were calculated for each examinee in the 
CBT and PPT groups, the PPT group examinees were sampled on propen- 
sity scores to match propensity scores for the examinees in the CBT group 
thereby leading to two distributions (i.e., PPT versus CBT) that were demo- 
graphically similar. It should be noted that the effectiveness of propen- 
sity matching depends largely on the nature of the independent variables 
included in the propensity matching procedure. In this study, all existing 
background variable were used as independent variables. However, if other 
variables were available and included in this matching procedure, it is 
possible to obtain a slightly different matching. 

At the item level, comparability of PPTs and CBTs was evaluated using 
the DIF detection procedure SIBTEST (Shealy & Stout, 1993) to identify 
items that function differentially for the PPT and CBT versions of the tests. 
SIBTEST provides an overall statistical test and a measure of the effect size 
(£>uni) f° r each item. In the SIBTEST framework, DIF is conceptualized as 
a difference between the probabilities of selecting a correct response, for 
examinees with the same levels of the latent attribute of interest (0). This 
difference, when found, is attributable to different amounts of nuisance 
abilities (r)) that influence the item response patterns. 
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The statistical hypothesis tested by SIBTEST is: 
H 0 :B(T)=P r (T)-P f (T) =0 

versus 

H i: B(T)=P R (T)-Pp(T)*0, 

where B(T) is the difference in probability of a correct response on the 
studied item for examinees in the reference (or advantaged) and focal (or 
disadvantaged) groups matched on true score; P R (T) is the probability of a 
correct response on the studied item for examinees in the reference group 
with true score T ; and P F (T) is the probability of a correct response on the 
studied item for examinees in the focal group with true score T. For clas- 
sifying the SIBTEST effect sizes, guidelines first recommended by Dorans 
(1989) for interpreting standardized p-values were used. Since both the 
Standardization and SIBTEST methods are essentially measuring the same 
construct (i.e., the total test score), using Doran’s criteria in the SIBTEST 
context seemed reasonable. Therefore, absolute values of the Beta statistic 
between .000 and .050 indicate negligible DIF, between .050 and .100 
indicate moderate DIF and .100 and above indicate large DIF. Items with 
large DIF are unusual and are often the ones for which test developers find 
content explanations for the DIF. Therefore they should be examined very 
carefully (Zenisky, Hambleton, & Robin, 2003-2004; Dorans & Holland, 
1993). Based on these guidelines, items are classified as “A” (negligible 
DIF), “B” (moderate DIF) or “C” (large DIF). Also, in all analyses, an alpha 
level of 0.05 was used with a non-directional hypothesis test. 

An initial concern before conducting the DIF analysis was whether the 
total scores from the two testing modes (PPT vs. CBT) were sufficiently 
comparable so that they conveyed the same meaning in the two testing 
modes. This was important to ensure because the two groups used in the 
DIF comparison are not equivalent groups and if the total scores did not 
convey the same meaning across the two testing modes, then an equating 
adjustment would be necessary. The d statistic calculated for the matched 
groups at the test level showed that there was no difference in mean 
performance in the two testing modes (Table 1, next page). Furthermore, 
the PPT and CBT score distributions for the matched groups were tested 
using the Kolmogorov -Smirnov (KS) test to determine whether the two 
distributions were significantly different for the reading, writing, and 
mathematics tests. The maximum difference (also know as D) for the PPT 
and CBT score distributions for the reading, writing, and mathematics 
tests are 0.03, 0.05, and 0.04, respectively. These values are not statisti- 
cally significant ( p > 0.01), suggesting that there is no statistically signifi- 
cant difference in the PPT and CBT score distributions. Thus, it seemed 
that a particular raw score conveyed the same meaning across the two 
testing modes, and therefore, the DIF analysis could be conducted using 
unmatched groups. 
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Table 1 : Means, Standard Deviations and Effect Sizes for the Computerized 

and Paper and Pencil Versions of the Certification Test 



Reading 

Writing 

Mathematics 

CBT 

PPT 

CBT 

PPT 

CBT 

PPT 

Mean 

30.597 

30.404 

30.877 

30.714 

27.862 

28.402 

SD 

6.719 

7.207 

6.911 

7.221 

7.502 

7.729 

Cohen's d 

0.028 

0.023 

-0.071 


Statistical Significance of the d statistic was calculated using an unpaired f-test 
*p < 0.05 


For the substantive analyses, two test reviewers independently exam- 
ined the test items flagged as DIF by SIBTEST. Items with a C-level rating 
on both random sub-sample runs or a C-level rating for one random sub- 
sample run and a B-level rating for the other random sub-sample run were 
examined. This decision seems justified since C-level items are typically 
scrutinized for bias and content explanations for the DIF (Zieky, 1993; 
Dorans & Holland, 1993). The two test reviewers worked independently 
and generated substantive explanations about the causes of possible mode 
DIF. Once the independent reviews were completed, the two test reviewers 
met to discuss their decisions and reach consensus on the items where 
they disagreed. 
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Results 

Statistical and substantive analyses were conducted at the test and 
item levels to evaluate the comparability of the paper-and-pencil and 
computerized versions of the certification tests. For the statistical analysis , 
comparability of the paper-and-pencil and computerized versions of the 
tests was evaluated using an effect size measure. Additionally, items on 
the paper-and-pencil and computerized versions of the tests were evalu- 
ated for DIF using SIBTEST across the two versions of the tests. For the 
substantive analysis , two test reviewers examined the items that were statis- 
tically flagged as DIF and generated substantive interpretations regarding 
the cause of DIF on those items. 

Statistical Analysis 

The d statistic was calculated for scores obtained from the paper-and- 
pencil and computerized administrations of the reading, writing, and 
mathematics tests. The means and standard deviations for the paper-and- 
pencil and computerized administrations for the three tests are similar 
(Table 1, previous page). As seen in Table 1, the effect sizes for the PPT and 
CBT comparisons for the reading, writing, and mathematics tests were 
small (less than than 0.20). Following Cohens (1988) criteria, these small 
effect sizes indicate that the PPT and CBT versions of the reading, writing, 
and mathematics tests are comparable. Furthermore, the d statistic com- 
puted for the paper-and-pencil versus computerized versions of the tests 
were not statistically significant (p >0.05) for the reading, writing, and 
mathematics tests, suggesting that the paper-and-pencil and computerized 
versions of these tests do not show a statistically significant difference. 

Furthermore, DIF analysis of the reading, writing, and mathematics 
tests were conducted to identify items that functioned differentially 
between the PPT and CBT versions of these tests. An initial concern was 
the possibility of contamination of the matching subtest. For example, if 
the first iteration of the DIF analysis indicate that several items display 
DIF, then it is likely that the matching subtest contains several DIF items 
thereby contaminating the matching subtest. To overcome this problem, 
a step-wise approach was used in which a DIF analysis was first conducted 
for each item (using the remaining items as the matching criterion) and 
items displaying large DIF were removed from the matching subtest. Then 
the data was re-analyzed using the purer matching subtest (i.e., DIF free), 
thereby leading to a more stable set of results. It should also be noted 
that an initial DIF analysis for the writing test was conducted using all 
45 items, in which the last five items showed DIF against the PPT group. 
Since these five items were also identified as speeded for the PPT version 
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of the test in an earlier study (see Boughton, Yamamoto, & Larkin, 2003) 
they were dropped from the current analysis, and the final DIF analysis for 
the writing test was conducted with 40 instead of 45 items. 

For the reading test there were no items identified as showing DIF for 
both PPT and CBT versions of the test. For the writing test, three items 
(items 1, 11, & 15) were identified as showing DIF. Item 1 measured the 
test taker’s ability to identify errors in adjectives, nouns, pronouns, and 
verbs (i.e., grammatical errors). Items 11 and 15 measured the test tak- 
er’s ability to identify errors in structural relationships such as error in 
comparison, negation, coordination, and correlation. These three items 
favored examinees who took the PPT (Figure 1, below and Table 2, next 
page). For the mathematics test, there were no items identified as showing 
DIF. Results of the statistical analyses suggest that performance across the 
PPT and CBT versions of the reading, writing, and mathematics tests are 
comparable at the test level (i.e., d statistic results). However, DIF analyses 
suggest that item level differences exist across the PPT and CBT versions 
of the writing test. Mode DIF was more prominent in the writing test as 
compared to the reading and mathematics tests in which there were no 
C-Level DIF items. 

Figure 1 : DIF Items for Paper and Pencil and Computerized Versions 

of the Writing Test 
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Table 2: Results of SIBTEST DIF Analysis for the Computerized and Paper 

and Pencil Versions of the Certification Test: Writing 



Writing 

Random Sub-sample 1 

Random Sub-sample II 

Item 

ft UNI 

Level 

ft UNI 

Level 

1 

-0.101* 

C 

-0.086* 

B 

11 

-0.099* 

B 

-0.101* 

c 

15 

-0.131* 

C 

-0.110* 

C 


A negative $ UN! favors Paper and Pencil Test-takers 
*p < 0.05 


Substantive Analysis 

Statistical methods are useful in detecting DIF items; however, to 
understand the nature of mode DIF, two test reviewers provided substan- 
tive reviews for the items identified as DIF (see, for example, Camilli & 
Shepard, 1994, p. xiii; Gierl & Khaliq, 2001). Without substantive analysis, 
it would be difficult to know whether an item that is flagged as DIF is due 
to differences in the mode of administration of these tests (e.g., paper 
and pencil vs. computer) or due to actual ability differences between the 
examinees from the two populations (i.e., item impact). The test reviewers 
did not find any difference in the items flagged as DIF that may cause 
differential performance on these items across PPT and CBT formats. Also, 
since the effect size measure showed no difference in mean performance 
across the two testing modes and the KS-test showed no difference in the 
PPT and CBT score distributions, it is less likely that item impact can be 
the cause of DIF on these items. Therefore the cause of DIF on these items 
could not be identified. 

The test reviewers found an overall difference between items in PPT 
and CBT formats for the full test but they did not ascribe this difference 
to be a cause of DIF for items that were flagged statistically. As mentioned 
earlier, this overall difference was that the items in PPT format had the 
multiple-choice options clearly marked as A, B, C, D, or no error. However, 
in the CBT format, these options were not marked as A, B, C, D, or no 
error. For some items, where the task was to identify grammatical errors 
in written passages (item 1 belonged to this category), these options were 
presented as underlined texts in the passages and the examinees were 
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instructed to click on the underlined text that denoted the correct option. 
For the remaining items (items 11 and 15 belonged to this category), oval 
icons were provided beside the options and the examinees were instructed 
to click on the oval icon that denoted the correct option (the Appendix on 
page 19 shows hypothetical examples of this difference in PPT and CBT 
formats). Because this difference was present for all the items in the test 
it would be erroneous to conclude that this difference resulted in DIF for 
the three items that were flagged statistically but not for the remainder of 
the items on the test. 

Discussion and Conclusion 

The purpose of the present study was to evaluate the comparability 
of PPT and CBT versions of a certification test designed to measure basic 
proficiency in reading, writing, and mathematics. The effect size mea- 
sure (i.e., the d statistic) was used to evaluate the comparability of the 
PPT and CBT versions of the reading, writing, and mathematics tests. The 
effect sizes were not statistically significant for the reading, writing, and 
mathematics tests. The effect size measure was also used to obtain a more 
practical estimation of the magnitude of difference between the PPT and 
CBT versions of the tests. Evaluation of the effect sizes suggested that 
these tests were comparable across the PPT and CBT formats. It should be 
noted, however, that statistically significant results are not always practi- 
cally important and vice versa. In the current study, although the differ- 
ence between the standardized means for the PPT and CBT groups were 
small statistically, it may still affect the pass/fail status of a large number 
of examinees especially when the total samples are large. Therefore, con- 
siderable thought by the testing program must go into determining the 
practical implications of these results. 

Furthermore, the items were analyzed using SIBTEST to identify items 
that function differentially for examinees who took the tests in PPT or 
CBT formats. There were no items flagged as DIF for the reading and math- 
ematics tests. There were three items (all favoring PPT) that were identified 
as DIF for the writing test across the two random sub-samples that were 
analyzed. However, substantive reviews failed to identify any difference 
in format that could lead to DIF on these items. The authors suggest that 
the testing program should monitor these three items in future testing 
administrations, and if they continue to show DIF statistically, then they 
should be replaced with new items. 

This study was important, as it has been demonstrated by earlier 
research that administering tests in PPT and CBT formats may affect 
the comparability of scores obtained from these testing formats. Since the 
certification tests have extremely high stakes outcomes for test-takers, it 
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was important to examine whether the tests yielded comparable scores 
when administered in PPT or CBT formats. Furthermore, by examining 
item-level performance in addition to test level performance, this study 
provided an opportunity to review format differences at the item level. As 
Gallagher et al. (2002) pointed out, with an increase in familiarity of stu- 
dents with computers, an overall measure of difference in test performance 
due to change in mode of delivery may appear less meaningful today. Thus, 
it was important to use both statistical and substantive analyses at the 
test and item level in order to ensure that tests are fair and valid for all, 
regardless of mode of presentation. As evident, the findings of this study 
were positive and suggested that the CBT and PPT versions of this certifi- 
cation test are comparable. 

Finally, the procedures used in the present study can be applied to 
other testing situations involving comparison of examinees' performance 
across different groups taking the same test in paper and pencil versus 
computer or Internet. The application of propensity matching to compare 
group differences is relatively new in educational testing. However, as 
pointed out by Yu, Livingston, Larkin, and Bonnet (2004), if considerable 
demographic information about the examinees is available, techniques 
using propensity score matching make it possible to sharpen any compari- 
sons of performance between examinee groups that vary systematically 
on certain background and demographic factors. 
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Appendix A 


This appendix provides two examples of hypothetical items in PPT and 
CBT format. In the first example, the task is to identify an error in the 
underlined and lettered (in case of PPT) portions of a written passage. 
In the second example, the task is to choose the correct answer from five 
possible options. 


Figure 2: Example 1 : Identify an Error in a Written Passage 

Item in PPT Format 


Entire issues of several journals have been devote to the issue of comparability 

A B C 

of paper and pencil and computerized test delivery modes. No error 


Entire issues of several journals have been devote to the issue of comparability 
of paper and pencil and computerized test delivery modes. No error 


Figure 3: Example 2: Choose the Correct Answer from Five Options 


D 


E 


Item in CBT Format 


Item in PPT Format 


Item in CBTFormat 


Who is the king of Mecopotania? 
^ Charles 
Popper 


Popper 


Charles 


Who is the king of Mecopotania? 



George 


@ Victor 


Victor 



Philips 


L-A 
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